🚨 Implement gradient checkpointing in GPTBigCode #41818

githubnemo · 2025-10-23T15:17:23Z

Support for gradient checkpointing was lost in the major refactoring in PR #38635 and this is the attempt to re-add it.

I extended the tests to

test use_reentrant=True and False
make sure model.train is called so that gradient checkpointing works; this is a limiation of the tests currently used by GPTBigCode
make sure that one (the first) gradient checkpointing layer is called
make sure that the same non-zero grads are there for normal and checkpointing runs - this is something we tripped over before in PEFT due to the possibly incompletely stored runtime environment in the checkpointed forward step, see also peft#2826

Note that the invocation of GPTBigCodeBlock.forward has changed:

layer_past is now passed as a keyword argument so that GradientCheckpointingLayer.__call__ can see and filter this parameter (use_reentrant=False fails otherwise)
{encoder_}hidden_states are still passed as positional arguments so that torch.utils.checkpoint.checkpoint receives them as pos. args and computes gradients for these (kwargs would be filtered by GradientCheckpointingLayer).

🚨 Note that this is breaking compatibility by changing the forward signature in GPTBigCodeBlock.forward!

Support for gradient checkpointing was lost in the major refactoring in PR huggingface#38635 and this is the attempt to re-add it. I extended the tests to - test `use_reentrant=True` and `False` - make sure `model.train` is called so that gradient checkpointing works; this is a limiation of the tests currently used by GPTBigCode - make sure that one (the first) gradient checkpointing layer is called - make sure that the same non-zero grads are there for normal and checkpointing runs - this is something we tripped over before in PEFT due to the possibly incompletely stored runtime environment in the checkpointed forward step, see also peft#2826 Note that the invocation of `GPTBigCodeBlock.forward` has changed: - `layer_past` is now passed as a keyword argument so that `GradientCheckpointingLayer.__call__` can see and filter this parameter (`use_reentrant=False` fails otherwise) - `{encoder_}hidden_states` are still passed as positional arguments so that `torch.utils.checkpoint.checkpoint` receives them as pos. args and computes gradients for these (kwargs would be filtered by `GradientCheckpointingLayer`).

HuggingFaceDocBuilderDev · 2025-10-23T15:26:33Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

vasqu

The tests are neat, I think we should move them to common tests tho. Not exactly sure why it was specially treated here.

And ig there will be a need for another round to check similar models that may have been accidentally overriden with the ckpting layer 😓 not necessarily this PR tho

vasqu · 2025-10-23T17:56:03Z

src/transformers/models/gpt_bigcode/modeling_gpt_bigcode.py

+        encoder_hidden_states: Optional[torch.Tensor] = None,
        layer_past: Optional[Cache] = None,
        attention_mask: Optional[torch.Tensor] = None,
-        encoder_hidden_states: Optional[torch.Tensor] = None,


Let's not change the order here, we could break things for users here. Rather change the args, kwargs positions if necessary on the module call

I'm not sure that this is possible. It is mandatory that we pass layer_past as keyword argument, otherwise GradientCheckpointingLayer will not be able to remove it from the kwargs in case of gradient checkpointing. On the other hand every input that may require gradients (hidden_states, encoder_hidden_states) must be passed as positional argument for checkpoint() to work. Maybe I'm missing something but I don't think we can bring those together without moving encoder_hidden_states up in the list.

I mean that the signature should stay the same, e.g. see

transformers/src/transformers/models/gpt_bigcode/modeling_gpt_bigcode.py

Lines 586 to 596 in 84d19be

def forward(

self,

hidden_states: Optional[tuple[torch.Tensor]],

layer_past: Optional[torch.Tensor] = None,

attention_mask: Optional[torch.Tensor] = None,

head_mask: Optional[torch.Tensor] = None,

encoder_hidden_states: Optional[torch.Tensor] = None,

encoder_attention_mask: Optional[torch.Tensor] = None,

use_cache: Optional[bool] = False,

output_attentions: Optional[bool] = False,

**kwargs,

It will need to adjust the calls from the module above like

transformers/src/transformers/models/gpt_bigcode/modeling_gpt_bigcode.py

Lines 901 to 910 in 84d19be

outputs = block(

hidden_states,

layer_past,

attention_mask,

head_mask[i],

encoder_hidden_states, # as a positional argument for gradient checkpointing

encoder_attention_mask=encoder_attention_mask,

use_cache=use_cache,

output_attentions=output_attentions,

)

Changing the signature is breaking a bit too much!

For viz, as discussed internally, we need this to be breaking

tests/models/gpt_bigcode/test_modeling_gpt_bigcode.py

vasqu · 2025-10-27T15:46:05Z

cc @ArthurZucker since this might become a bit more breaking than initially thought, and it likely affects more models

- Compare that the non-zero gradients in a reference run are present in the checkpointing run - Make sure that the forward of at least one gradient checkpointing layer is actually called more than once (as expected during gradient checkpointing backward) Currently there are some problems with Bert-derived MultipleChoice models, when dropout is enabled there are scenarios during gradient checkpointing where `classifier.bias.grad` is None. I don't yet have a good explanation for this, disabling dropout resolves this. I would have understood, if it is dropout on the classification layer but enabling attention dropout is also leading to this behavior. MoE models have selective sparsity depending on the selected experts, for this reason we only compare gradients on parameters collected on the reference backward run.

githubnemo · 2025-10-29T13:05:44Z

I've updated the general tests. From the commit message:

- Compare that the non-zero gradients in a reference run are present in the checkpointing run
- Make sure that the forward of at least one gradient checkpointing layer is actually called
  more than once (as expected during gradient checkpointing backward)

Currently there are some problems with Bert-derived MultipleChoice models, when dropout is
enabled there are scenarios during gradient checkpointing where `classifier.bias.grad` is None.
I don't yet have a good explanation for this, disabling dropout resolves this. I would have
understood, if it is dropout on the classification layer but enabling attention dropout is
also leading to this behavior.

MoE models have selective sparsity depending on the selected experts, for this reason we
only compare gradients on parameters collected on the reference backward run.

Currently these models are expected to fail since they're not implementing GradientCheckpointingLayers:

swiftformer
xlstm
zamba
zamba2

most likely these as well:

janus (no training testing?)
clvp

As I explained in the commit message, there's a strange bug with Bert-derived models when testing the BertForMultipleChoice case. When attention(!) dropout is active, sometimes the classification bias receives a None gradient. I don't have a good explanation for this right now but that seems fishy. Happy about any input.

I didn't revert the GPTBigCode test changes yet since I first wanted to get an opinion if we want to proceed with these more general tests or not.

vasqu

I think this is fine, we will likely need to break other model's signature as well? I.e. not only got bigcode. This PR will get bigger than initially thought but let's fix these models

We can allow this for v5 but let's also mention this PR in the v5 thread (#40822) when we merge this.

tests/test_modeling_common.py

vasqu · 2025-10-29T13:30:30Z

tests/test_modeling_common.py

+                # TODO I don't understand why attention_probs_dropout_prob influences classifier.bias in
+                # BertForMultipleChoice (and other Bert derived models). Sometimes classifier.bias is None
+                # when attention_probs_dropout_prob > 0. This might indicate a bug somewhere.
+                if hasattr(config, "hidden_dropout_prob"):
+                    config.hidden_dropout_prob = 0.0
+                if hasattr(config, "attention_probs_dropout_prob"):
+                    config.attention_probs_dropout_prob = 0.0


This is only for the multiple choice class? Or are other model types also affected?

I don't think these have high usage either way so it's fine when we leave an explanation here

This is only for the multiple choice class? Or are other model types also affected?

It didn't seem to make a difference for other models with the limited test runs I made. If you want I can limit it to the problematic model class.

I think it's fine with the comment but would be nice if you could double check if it may effect other model (classes)

vasqu · 2025-10-29T13:31:57Z

src/transformers/models/gpt_bigcode/modeling_gpt_bigcode.py

+        encoder_hidden_states: Optional[torch.Tensor] = None,
        layer_past: Optional[Cache] = None,
        attention_mask: Optional[torch.Tensor] = None,
-        encoder_hidden_states: Optional[torch.Tensor] = None,


For viz, as discussed internally, we need this to be breaking

vasqu · 2025-10-30T10:43:42Z

tests/test_modeling_common.py

+            # Gradient checkpointing is implemented via GradientCheckpointingLayer, if none is present this is likely
+            # an implementation issue. Note we exclude xlstm and zamba* for now since they are still not using
+            # GradientCheckpointingLayer.
+            if config.model_type not in [
+                "xlstm",
+                "zamba",
+                "zamba2",
+                "swiftformer",
+                "janus_vqgan",
+                "clvp_encoder",
+                "clvp_decoder",
+            ]:
+                self.assertTrue([m for m in model.modules() if isinstance(m, GradientCheckpointingLayer)])


Like discussed, let's fix these within this PR directly. Seems like these were more like unintentional regressions due to other big PRs

also drop janus from ignore list - only the VQVAE case is without gradient checkpointing and it is doubtful that it is usefule in that case. Training with gradient checkpointing is not tested anyway.

vasqu

Just noticed this small thing in xlstm

Re: Clvp, let's isolate it for now. We can come back later except you have a good idea how to refactor/handle this properly

src/transformers/models/xlstm/modeling_xlstm.py

The implementation of GradientCheckpointingLayers is not trivial and may break behavior that was previously expected. Therefore we keep it as-is for now.

github-actions · 2025-10-30T17:36:41Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: gpt_bigcode, swiftformer, xlstm, zamba, zamba2

vasqu reviewed Oct 23, 2025

View reviewed changes

githubnemo changed the title ~~Implement gradient checkpointing in GPTBigCode~~ 🚨 Implement gradient checkpointing in GPTBigCode Oct 27, 2025

vasqu reviewed Oct 29, 2025

View reviewed changes

nemo added 3 commits October 29, 2025 18:27

Remove duplicated gradient checkpointing code

4e09628

Address review comments

ccf4b92

Make test output consistent

2db4fc5

vasqu reviewed Oct 30, 2025

View reviewed changes

nemo added 2 commits October 30, 2025 13:10

GradientCheckpointingLayer for xlstm, zamba, zamba2

68eef7f

GradientCheckpointingLayer for swiftformer

55a09f0

also drop janus from ignore list - only the VQVAE case is without gradient checkpointing and it is doubtful that it is usefule in that case. Training with gradient checkpointing is not tested anyway.

vasqu reviewed Oct 30, 2025

View reviewed changes

src/transformers/models/xlstm/modeling_xlstm.py Show resolved Hide resolved

Make an exception for CLVP

f29b4d4

The implementation of GradientCheckpointingLayers is not trivial and may break behavior that was previously expected. Therefore we keep it as-is for now.

	def forward(
	self,
	hidden_states: Optional[tuple[torch.Tensor]],
	layer_past: Optional[torch.Tensor] = None,
	attention_mask: Optional[torch.Tensor] = None,
	head_mask: Optional[torch.Tensor] = None,
	encoder_hidden_states: Optional[torch.Tensor] = None,
	encoder_attention_mask: Optional[torch.Tensor] = None,
	use_cache: Optional[bool] = False,
	output_attentions: Optional[bool] = False,
	**kwargs,

	outputs = block(
	hidden_states,
	layer_past,
	attention_mask,
	head_mask[i],
	encoder_hidden_states, # as a positional argument for gradient checkpointing
	encoder_attention_mask=encoder_attention_mask,
	use_cache=use_cache,
	output_attentions=output_attentions,
	)

🚨 Implement gradient checkpointing in GPTBigCode #41818

Are you sure you want to change the base?

🚨 Implement gradient checkpointing in GPTBigCode #41818

Uh oh!

Conversation

githubnemo commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Oct 23, 2025

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vasqu commented Oct 27, 2025

Uh oh!

githubnemo commented Oct 29, 2025

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Oct 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

githubnemo commented Oct 23, 2025 •

edited

Loading